-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add total.label argument for groupingsets, cube, rollup #5973
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5973 +/- ##
==========================================
+ Coverage 97.51% 97.52% +0.01%
==========================================
Files 80 80
Lines 14915 14987 +72
==========================================
+ Hits 14544 14616 +72
Misses 371 371 ☔ View full report in Codecov by Sentry. |
As for the arg name I would keep it as "label" |
R/groupingsets.R
Outdated
@@ -57,6 +57,16 @@ groupingsets.data.table = function(x, j, by, sets, .SDcols, id = FALSE, jj, ...) | |||
stopf("Argument 'sets' must be a list of character vectors.") | |||
if (!is.logical(id)) | |||
stopf("Argument 'id' must be a logical scalar.") | |||
if (!(is.null(total.label) || | |||
(is.character(total.label) && length(total.label) == 1L) || | |||
(is.list(total.label) && all(vapply_1b(total.label, is.character)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rule out list because AFAIU char vector will be sufficient. Please correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangorecki Thanks for reviewing. To make sure I understand you correctly, do you mean that total.label = list(Region = "National", Product = "Total")
should not be allowed, and it should instead be total.label = c(Region = "National", Product = "Total")
? I would have to try it to be sure, but I think that would be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as long as each element in the list is scalar then list is not necessary and character vector is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangorecki Thanks, I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was wrong and labels must not necessarily be of character type, is that correct? Then we need a list type so mixed types can be provided.
Would be good to use such little bit more complex example so it comes up straight away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jangorecki. Currently it only supports character and factor grouping variables. For other types, it's left as NA.
In #5351 you wrote
Then we of course want total.label="Total" to be recycled to list(State = "Total", Group = "Total"). Of course only if all columns are character.
I interpreted this as meaning the label argument should only apply to variables of type character, but now I realise that maybe I interpreted incorrectly and you might have meant that total.label = "Total"
should only be recycled to character variables, but non-character labels for non-character variables are still allowed in a list.
A couple of options (and there might be others):
- Only apply the label argument to grouping variables of type character or factor, and require
label = c(Region = "National", Product = "Total")
instead oflabel = list(Region = "National", Product = "Total")
. - Allow the label argument to apply to grouping variables that aren't character or factor, and allow total.label to be a list so that different types can be specified for different variables.
Which do you prefer? As I mentioned above, if I had a date or integer grouping variable, I would probably still want a character total label (e.g. "Total") rather than a date or integer total label. To do this, currently the user would have to convert the variable to character or factor before using groupingsets/cube/rollup.
Maybe for non-character grouping variables, we could allow a choice of specifying a label of the same type as the variable, or a character label, and if a character label is provided then the variable will be converted to character in the output. For example, if A
is character and B
is integer, then label = list(A = "LabelA", B = 999L)
and label(A = "LabelA", B = "LabelB")
would both be allowed, and the second would result in B being character in the output. If we do that, then I think I would favour having label = "Label"
apply to all variables, not just character variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing column type is not an option, column type needs to be as original columns used in 'by' and not in labels (so no for option B). Option 2. We cannot apply it to all variables because we must match column types, we don't just print results but return a data.table so saying it is only for output won't be useful. Our output is data.table and column types cannot depend on labels arg. It is the labels arg has to match to column types in the output 'by'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "output", I meant the data.table that is returned. Sorry for the confusion.
Thanks for clarifying that the column types have to stay the same. That makes it simpler.
Do you think the shorter form label = "Total"
should be allowed, or should label
always be a named list? If the shorter form is allowed, should it always have to be character, or would something like label = 999L
be allowed and apply to all the integer grouping variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be allowed but for non-char column will result an NA, which anyway is likely to be expected. Similarly 999L can be also allowed but would result NA for non int columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll give that a try.
@jangorecki Thanks for your comment. Do you mean change it from "total.label" to "label", or keep it as "total.label"? |
Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported. |
Very good PR. What would be even more useful is minimal example usage in NEWS entry. Including different data types for labels. |
@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels". |
@jangorecki Thanks for the feedback. I can add an example to the NEWS entry. |
Another reason why I prefer single word argument name is it does not enforce any style, like total.label or totalLabel or total_label will do. |
OK, I can change it. Maybe "label" is better than "labels", since "labels" is a base function. |
@jangorecki I've updated the functions, tests, documentation, and news, based on our discussions. When you have time, please let me know what you think. I wasn't sure about the correct way to update a PR. Apologies if I've done it incorrectly. I don't understand the reason for the red cross next to the News commit. When I click on the cross, it says "R-CMD-check / windows-latest (devel) (pull_request)" was cancelled, but in the list of checks, it says that check was successful. Thanks. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@jangorecki any final review here? I can help merge to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't closely looked at the logic and test coverage but found few places which doesn't look perfect
Add 'label' argument to the groupingsets.data.table(), cube.data.table(), and rollup.data.table() functions.
Add to the Usage, Arguments, Details, and Examples sections.
@jangorecki Thanks for reviewing. I've made the changes you asked for. Regarding the code-quality / lint-r check, I fixed some of them but I hope it will be OK to leave long variable names because they're more descriptive that way. |
Please avoid force push when PR is already under reviews |
Thanks for approving and thanks for this feedback. I don't know what "force push" means but will try to find out so I can avoid doing them. |
Here's what we're seeing: Most of the time when this happens you get an error when trying |
One other note, I see this is your
|
(is.atomic(label) && length(label) == 1L) || | ||
(is.list(label) && all(vapply_1b(label, is.atomic)) && all(lengths(label) == 1L) && !is.null(names(label))))) | ||
stopf("Argument 'label', if not NULL, must be a scalar or a named list of scalars.") | ||
if (is.list(label) && !is.null(names(label)) && ("" %chin% names(label) || anyNA(names(label)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this !is.null(names(label))
check redundant? Since we have is.list(label) && ... !is.null(names(label))
in the above requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do find it a bit surprising that the above check requires "a named list of scalars" but we have a separate test for "all list elements must be named", maybe best to add in the check for ""
/NA
names to the above condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can address this as a small follow-up PR if you agree, don't want to hold the PR back further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this
!is.null(names(label))
check redundant?
Yes, I think you're right.
I do find it a bit surprising that the above check requires "a named list of scalars" but we have a separate test for "all list elements must be named", maybe best to add in the check for
""
/NA
names to the above condition?
Maybe, but with separate checks the error messages can be more specific, the second one being for the situation where label
is a named list but not all elements have a name. If we combine the error messages into one, it would be something like "Argument 'label', if not NULL, must be (1) a scalar, or (2) a named list with each element being named and each element being a scalar." Or "Argument 'label', if not NULL, must be (1) a scalar, or (2) a named list with no names being "" or NA and each element being a scalar." I think these are less clear and less helpful than having separate error messages depending on the situation. If you disagree, please let me know. It's not something I feel strongly about.
Honestly I think it's overdoing it here. Those variables have very limited scope for an error branch. I just used simple names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
@MichaelChirico Thanks for checking and improving the code. I'll study your revised code and learn from it.
Thanks for the explanation. I see that too. I used the GitHub GUI in the browser rather than command-line git (I haven't learned command-line git yet). I didn't explicitly select "force push", but just synced my fork and then made the changes, and GitHub automatically did the force push. Maybe I should have just changed the files without syncing the fork first.
Thanks, I'll keep that in mind.
Yes, this PR was started before I was a member. |
Closes #5351
Added a
total.label
argument to therollup.data.table
,cube.data.table
, andgroupingsets.data.table
functions.A couple of comments:
all.label
might be a better name thantotal.label
, since the thing being calculated won't always be a total. But I left it astotal.label
because the documentation uses the terminology "(sub-)totals", and the word "total" is used in other software, so "total" seems to be widely accepted.